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1 . Introduction 

This  Quarterly  Technical  Report,  Number  2,  describes  aspects 
of  our  work  performed  under  Contract  No.  MDA903-78-C-G356  during 
the  period  from  1 November  1978  to  31  January  1979-  This  is  the 
second  in  a series  of  Quarterly  Technical  Reports  on  the  design 
of  a packet  speech  concentrator,  the  Voice  Funnel. 

Most  of  our  effort  during  this  quarter  has  concentrated  on 
the  development  of  a firm  framework  for  supporting  the  Voice 
Funnel  software,  including  both  the  elaboration  of  the  Eutterfly 
Multiprocessor  hardware  and  the  design  of  a suitable  operating 
system  for  the  machine,  which  will  ease  software  development  in  a 
multiprocessor  environment.  In  the  following  sections  we  present 
some  of  the  results  of  this  work. 
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2.  Operating  System  Development 

During  the  period  covered  by  this  report  we  have  spent 
considerable  effort  developing  the  basic  design  of  the  operating 
system  for  the  Voice  Funnel.  The  operating  system  provides  a 
user  environment  which  attempts  to  insulate  the  application  from 
the  raw  hardware,  while  retaining  access  to  the  capabilities 
provided  by  the  hardware  and  not  imposing  much  overhead.  The 
operating  system  attempts  to  support  only  the  application;  it  is 
not  a general  purpose  facility  and  does  not  attempt  to  support 
program  development.  We  will  describe  some  of  what  we  have 
learned  about  the  appropriate  design  for  the  operating  system  of 
the  Voice  Funnel  cn  the  Butterfly  Multiprocessor. 

2.1  Operating  System  Overview- 

Experience  in  a variety  of  situations  has  convinced  us  that 
we  should  separate  resource  management  and  problem-specific 
concerns  in  the  Voice  Funnel  software.  Having  built  systems  with 
and  without  such  a separation,  we  are  painfully  aware  that  when 
nc  such  separation  exists,  several  significant  and  costly 
problems  arise.  Among  the  most  important  of  these  are  that: 

- problem-related  algorithms  become  cluttered  with  details 
irrelevant  to  the  problem  at  hand; 

- development  and  maintenance  proceeds  much  mere  slowly  than 
the  intrinsic  difficulty  of  a task  would  suggest; 

- adaptation  to  changes  in  strategy  is  hindered  by 
inflexibility  of  the  general  structure;  and 


the 
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- errors  frequently  occur  which  are  not  directly  related  to 
the  problem  being  solved. 

With  such  strong  arguments  in  favor  of  this  separation,  it 
may  seem  surprising  that  the  alternative  is  ever  considered. 
There  are,  of  course,  dangers,  both  apparent  and  real,  in 
creating  such  a separation.  The  separation  is  typically  achieved 
by  building  an  operating  system.  An  operating  system  is 
seductive  and,  if  net  controlled,  it  can  grow  to  consume  far  too 
much  of  the  personnel,  memory,  and  processor  bandwidth  available 
to  a project.  Also,  because  it  is  a separately  identified 
component,  an  operating  system  can  appear  to  represent  additional 
or  unnecessary  work.  Cf  the  two  concerns,  the  latter  is  mere 
apparent  than  real,  since  what  really  occurs  is  that  work  which 
wculd  otherwise  be  distributed  throughout  the  software  is  instead 
consolidated  and  shifted  from  one  area  into  another.  The  concern 
about  the  operating  system  task  growing  cut  of  control  is  real, 
however,  and  must  be  carefully  managed. 

The  approach  we  propose  to  take  is  tc  build  a limited,  tut 
potent  and  extensible,  operating  system  which  is  sufficient  to 
meet  the  perceived  requirements  of  the  Voice  Funnel  and  which  is 
flexible  enough  tc  adapt  to  changes  in  those  requirements  and  to 
the  emergence  of  any  future  requirements  which  can  reasonably  be 
anticipated.  We  have  begun  this  task  by  reviewing  those 
capabilities  which  have  become  common  in  operating  systems 
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[BRIN  73]  and  by  examining  any  special  requirements  which  are 
imposed  by  the  Voice  Funnel  task  or  the  Butterfly  Multiprocessor 
hardware.  We  have  attempted  to  select  those  capabilities  needed 
in  one  form  or  another  for  the  Voice  Funnel  and  to  design  an 
operating  system  which  provides  those  capabilities  while  showing 
promise  of  being  extended  to  meet  additional  requirements,  should 
they  arise. 

Throughout  the  design  of  both  the  Voice  Funnel  and  the 
operating  system  software,  we  have  paid  considerable  attention  to 
the  choice  of  services  to  be  provided  by  the  operating  system, 
for  it  is  not  the  number  cr  complexity  of  the  operating  system 
features  which  matters,  but  rather  their  simplicity,  power,  and 
cost.  Time  spent  in  reducing  the  overall  conception  to  a small 
set  of  elementary,  powerful,  and  inexpensive  facilities  will  pap- 
off  handsomely  in  reduced  time  to  develop  both  the  application 
and  the  operating  system,  in  reduced  consumption  of  processor 
cycles,  and  in  increased  effective  utilization  of  personnel  and 
hardware. 

Three  concepts  in  particular  have  guided  us  in  pursuing 
these  objectives.  We  have  tried  to  be  conscious  of  what  the 
programmer  car.  dc  himself  as  easily  as  can  be  done  automatically. 
We  have  tried  to  avoid  building  attractive  facilities  for  which 
we  cannot  see  a clear  justification  in  the  current  application. 
Finally,  we  have  taken  advantage  of  the  fact  that  the  Voice 
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Funnel  is  a dedicated  application.  This  will  allow  us  to  avoid 
much  of  the  checking  and  enforcement  which  would  be  required  in 
an  environment  which  had  to  run  arbitrary  user  programs. 
Instead,  we  will  be  -ale  to  limit  checking  to  those  conditions 
which  might  be  expected  to  arise  from  routine  hardware  or 
software  failures  rather  than  from  malicious  behavior  or  outright 
negligence . 


The  principal  facilities  of  the  proposed  operating  system 
are  briefly  described  in  the  following  paragraphs. 

- Processor  Management  (or  Scheduling):  Allows  a large 

number  of  processes*  to  share  one  or  more  processors 

without  requiring  any  of  them  to  knew  about  multiple 
processors,  about  which  processor  they  might  run  on,  or 
about  what  to  do  with  time  they  are  not  using.  Assigns 
work  to  processors,  shares  available  processor  time  based 
on  a mix  of  priority  and  fairness  considerations,  and 
attempts  to  maintain  more  or  less  level  load  on  the 

various  processors.  Permits  processes  to  run  concurrently 
(i.e.,  overlap  in  time),  and  permits  them  tc  operate 
asynchronously  (by  buffering  or  queuing  work  or  data 
passing  between  them) . 

- Memory  Management:  Ensures  that  each  process  has 

transparent  access  to  the  memory  assigned  to  it,  allows 
protection  of  the  private  memory  (e.g.,  instructions  and 
stack)  of  any  process  from  any  other  process,  and  allows 
voluntary  sharing  of  memory  by  cooperating  processes. 
Relieves  each  process  of  the  task  of  explicitly  managing 
the  memory  mapping  registers,  and  protects  each  from 

incorrect  use  of  the  registers  by  the  others. 


* we  will  use  process  in 
refer  to  the  various 

software  components  (or  programs)  which  may 
processor  by  a scheduler.  We  will  use  task 
scientists  have  used  in  place  of  process ) 
tc  mean  a piece  of  work  tc  be  done. 


s customary  computer  science  sense  tc 
independent  (ar.d  perhaps  cooperating) 

be  giver,  control  of  a 
(which  seme  computer 
in  its  everyday  sense 
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- Inter-process  Communication:  Facilitates  the  transfer  of 
messages  or  data  from  one  process  to  another,  either 
through  shared  memory  or  by  copying  data  from  one  address 
space  to  the  other.  Provides  simple  send  and  receive 
mechanisms  which  relieve  the  user  of  the  need  to  perform 
mutual  exclusion,  memory  management,  and  scheduling 
operations  directly. 

- Process  Synchronization:  Provides  for  coordinated  use  of 
shared  data,  prevent ing  conflicting  access  or  modification 
by  multiple  processes.  Provides  for  coordination  between 
processes  awaiting  or  causing  common  events. 

- Interrupt  Handling;  Provides  a common  mechanism  for 

servicing  interrupts  and  for  utilizing  them  to  keep  work 
flowing  briskly  through  the  system. 

- Reliability  and  Availability:  Provides  mechanisms  for 

(1)  periodically  assessing  the  health  of  all  hardware,  all 
software  and  all  critical  data  bases;  (2)  identifying  and 
removing  from  use  any  failed  components;  and 
(3)  realigning  the  remaining  components  to  continue 

operation  at  reduced  capacity  (graceful  degradation). 
Fresents  a reliable  framework  within  which  the  application 
can  run  without  itself  having  to  attend  to  all  of  the 
details . 


A framework  such  as  we  have  described  presents  a number  of 
important  advantages.  The  Voice  Funnel  software  will  be 
organized  by  function,  with  capabilities  that  are  required  at 
many  points  being  concentrated  into  a small  number  cf  elementary 
and  powerful  functions  which  can  be  built  and  tested  once  and 


then  used  repeatedly.  Problems  intrinsic  to  the  application  will 
be  clearly  separated  from  problems  cf  taming  the  environment, 
with  the  result  that  solutions  cf  both  types  can  be  developed  and 
expressed  independently.  This  will  yield  simpler  and  mere 
effective  solutions  in  both  areas.  It  will  make  it  easier  to 
experiment  with  new  algorithms  or  to  incorporate  new 
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requirements,  since  the  number  of  problems  requiring  simultaneous 
solutions  will  be  reduced.  It  will  also  increase  the  safety  of 
critical  data,  since  the  data  will  be  handled  in  fewer  places. 
While  there  will  certainly  be  some  cases  in  which  this  solution 
will  be  less  efficient  than  one  in  which  the  application  handles 
all  resource  problems  in  line,  careful  attention  to  the  potential 
dangers  of  this  approach  will  enable  us  to  minimize  them.  The 
gains  realized  through  more  effective  global  control  and 
optimization,  through  greater  flexibility  and  adaptability,  and 
through  greater  reliability,  will  yield  benefits  much  greater 
than  the  occasional  inefficiencies  incurred. 

2.2  Exploiting  Farallelism 

Numerous  workers  have  attempted  to  achieve  parallelism  at 
various  levels  in  the  program  decomposition  hierarchy  [EAER  73] • 
This  has  been  attempted  at  the  following  levels,  among  others: 

- at  the  program  level  (which  has  occurred  mcst  often), 

- at  the  procedure  level  (for  example,  FL/I  for  the  IEK 
370), 

- at  the  statement  level  (for  example,  the  p a rb eg  in/ par end 

construction  of  Dijkstra  [DIJK  65],  or  the  ALuCL  65 

compiler  for  CMU's  C.mmp  multiprocessor  [KNUE  76]), 

- at  the  sub-expression  level  (primarily  in  theoretical 
work) , and 

- at  the  instruction  level  (for  example,  the  CDC  660C 
[TKCR  64] ) . 
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Of  course,  at  each  level,  there  may  be  sequences  of  objects 
(e.g.,  a group  of  statements  which  must  ' e executed  sequentially) 
for  which  ordering  must  be  preserved. 

Bor  various  reasons  we  have  chosen  to  focus  at  the  program 
level  in  providing  parallelism-  within  the  Voice  Funnel.  The 
application  contains  enough  parallelism  at  this  level  to  take 
effective  advantage  cf  our  multiprocessor  architecture.  In 
addition,  this  approach  provides  good  control  over  locality  of 
memory  references,  which  greatly  influences  the  level  of  hardware 
efficiency  achieved.  Finally,  by  using  this  approach,  we  can  use 
an  existing  compiler  (with  a new  Z6000  code  generator),  rather 
than  having  to  write  a new  compiler  cr  undertake  major  structural 
changes  tc  an  existing  one. 

Adopting  the  second  approach  (procedure  level  parallelism) 
would  require  modifying  the  compiler  tc  use  seme  mechanism  ether 
than  a stack  for  automatic  variables,  subroutine  linkage,  and 
temporary  storage,  since  multiple  processors  could  not  share  a 
stack.  In  addition,  whenever  multiple  processors  were  applied  to 
procedures  within  a pregram,  any  data  inherited  from  the  calling 
procedure,  and  perhaps  the  instructions,  would  be  remote  tc  the 
new  processors.  For  the  Butterfly  Multiprocessor,  this  might 
have  significant  adverse  effects  or:  execution  efficiency  if  the 
ratio  of  remote  to  local  references  became  too  large.  Finally, 
the  task  of  writing  programs  to  utilise  this  degree  cf 
parallelism  would  be  mere  difficult. 


c - 
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The  third  approach  (statement  level  parallelism)  would  have 
all  of  the  difficulties  associated  with  the  procedure  level 
approach  and  would  place  additional  burdens  on  the  programmer  and 
the  compiler.  At  this  level,  there  are  two  principal 
sub-approaches  available:  having  the  compiler  recognize  and 
exploit  opportunities  for  parallelism,  and  having  the  programmer 
do  it  [DIJK  65].  The  former  sub-approach  obviously  requires  both 
a clever  compiler  and  a clever  run-time  system  [KNUE  76].  Eoth 
approaches,  however,  require  that  storage  management  similar  to 
that  required  for  parallel  procedures  be  employed  wherever  the 
compiler  or  the  programmer  introduces  parallelism  (i.e.,  much 
more  often  then  would  otherwise  occur).  Programming  at  this 
level  would  be  much  more  difficult  than  at  earlier  levels  because 
the  programmer  would  have  to  write  true  multiprocessor  programs, 
rather  than  cooperating  uniprocessor  programs  or  procedures. 

Work  or.  the  fourth  approach  (sub-expression  parallelism)  has 
been  largely  theoretical  [BAER  73]  and  is  not  of  interest  here. 
Obviously,  it  requires  a sophisticated  compiler,  and  it  may  well 
require  special  processor  capabilities . Because  the  burden  of 
detecting  and  exploiting  parallelism  would  be  shifted  from  the 
programmer  to  the  compiler  and/or  the  processor,  the  programming 
task  would  be  easier  than  for  the  previous  approach. 

The  fifth  approach  is  not  really  a multiprocessor  approach. 
Instead,  it  has  teen  applied  to  large  uniprocessors  with  multiple 
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function  units  capable  of  executing  several  instructions 
simultaneously.  Like  the  fourth  approach,  it  is  not  of  real 
interest  here. 

A second  major  choice,  in  addition  to  selecting  the  level  at 
which  to  apply  parallelism,  concerns  control  mechani'sms  to  be 
employed  for  switching  control  between  parallel  streams.  In 
particular,  it  concerns  whether  the  system  scheduler  should  be 
involved  in  all  decisions  to  switch  control,'  or  whether  seme 
control  switching  decisions  might  be  made  more  efficiently  by 
closely  cooperating  streams  within  an  application.  The  former 
approach  has  usually  been  taken,  and  in  such  cases  the  parallel 
streams  have  normally  beer,  called  processes . hore  recently, 
certain  workers  [KNUE  76]  have  suggested  another  level  of 
scheduling  in  which  each  process  might  be  allocated  one  cr  me  re 
processors  which  it  would  then  switch  amongst  various  internal 
act ivi t ies  using  much  simpler  mechanisms  than  a scheduler  might 
use . 


we  consider  the  concept  of  scheduling  activities  to  be  a 
very  intriguing  area  for  further  study.  However,  we  will  net  use 
it  for  the  Voice  Funnel  because  it  seems  most  appropriate  when 
parallelism  occurs  at  the  statement  level  or  the  procedure  level 
and  when  the  frequency  of  ncn-local  references  does  net  strongly 
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A quite  different  attempt  at  eliminating  the  scheduler  was 
used  in  the  Pluribus  Multiprocessor  [KATS  78 J . In  the  Pluribus, 
programs  are  written  as  strips.  Strips  have  two  very  important 
properties:  they  must  complete  within  a very  limited  time 
(determined  by  device  latency  requirements),  and  they  may  not 
preserve  any  private  context  when  they  finish  (since  they  have  no 
context,  context  switching  is  avoided).  Work  to  be  done  by  the 
strips  consists  of  either  I/C  device  service  or  wcrk  that  has 
been  queued  internally.  Each  device  and  each  queue  has  a 
priority.  Extremely  fast  dispatching  is  provided  by  a hardware 
unit  from  which  a processor  which  has  completed  a strip  can 
determine  the  identity  cf  the  highest  priority  device  cr  queue 
requiring  service.  Eespite  the  advantages  cf  rapid  dispatching 
and  minimal  context  switching,  the  Pluribus  approach  has  proven 
less  flexible  and  mere  difficult  to  program  than  the  other 
approaches  considered.  In  fact,  recent  work  on  the  Fluribus  has 
involved  superimposing  a process  mechanism  upon  the  strip 
mechanism . 

Much  cf  the  foregoing  has  been  independent  of  the  specific 
characteristics  of  the  Voice  Funnel.  It  would,  therefore,  be 
appropriate  to  consider  why  the  choices  made  thus  far  are 
sensible  for  that  application. 

The  Voice  Funnel  will  comprise  a large  number  of  independent 
data  streams  going  to  cr  from  various  speech  terminals.  Each 
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data  stream  will  comprise  a large  number  of  items  (control 
requests  and  speech  packets) . Each  item  will  pass  through  a 
number  of  processing  stages  (speech  terminal  input,  one  or  more 
multiplexing  stages,  output  to  the  PSAT,  and  the  reverse),  both 
as  an  individual  item  and  as  a member  of  an  aggregate.  While 
end-to-end  sequencing  within  data  streams  must  be  observed, 
sequencing  between  data  streams  need  net  be. 

The  work  to  be  done  by  the  Voice  Funnel  thus  breaks  down 
naturally  into  a number  of  tasks  corresponding  to  the  processing 
of  a single  item  or  aggregate  of  items  at  a particular  stage. 
Except  that  any  item  must  move  through  its  stages  in  sequence, 
and  except  that  items  from  the  same  data  stream  must  be  delivered 
in  sequence  at  their  destination,  it  will  generally  be  possible 
to  perform  tasks  in  parallel.  Moreover,  because  processing 
stages  will  typically  be  very  small,  further  decomposition  of 
tasks  will  not  usually  be  attractive  because  even  negligible 
overhead  would  become  large  as  task  size  became  very  small. 

Given  this  breakdown  of  the  work,  it  is  natural  to  view  the 
application  software  as  comprising  a number  of  p r egr  ams  each  of 
which  will  be  able  to  perform  one  of  the  tasks  which  make  up  the 
application.  Eecause  we  have  initially  chosen  to  allow 
parallelism  only  at  the  program  level,  and  because  we  do  not 
intend  to  allow  processes  to  perform  sub-scheduling  (i.e.,  we 
will  net  utilize  activities),  each  program  servicing  a task  will 
be  a process . 

- 12  - 
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In  order  to  achieve  parallelism  between  data  streams,  it 
will  be  necessary  to  process  several  tasks  of  the  same  type 
simultaneously  on  a number  of  processors.  This  will  require 
several  instances  of  each  process  for  which  parallelism  is 
desired,  since  only  one  processor  at  a time  can  run  a single 
instance  of  a process.*  While  all  such  instances  might  share  a 
ccpy  of  the  code,  processor  utilization  will  be  much  mere 
efficient  if  code  is  always  local  to  the  processor  on  which  it 
runs.  Therefore,  each  processor  which  will  run  a process  will 
have  a copy  of  the  code.  Since  each  instance  of  a process  must 
have  its  own  stack  and  private  data,  these  will  also  be 
replicated,  and  for  efficiency  reasons  will  also  be  local  to  the 
processor  cn  which  they  will  run.  By  making  multiple  copies  of 
all  processes,  whether  or  not  their  instances  will  actually  run 
in  parallel,  we  will  provide  the  scheduler  with  more  choices  as 
to  where  to  run  each  process,  and  we  will  provide  redundancy  as 
protection  against  the  loss  of  any  single  processor. 

Rost  Voice  Funnel  processes  will  be  created  when  the 
application  starts  and  will  cycle  indefinitely,  processing  tasks 
successively  as  they  arrive  for  service.  In  addition,  it  will  be 

* *here  it  is  necessary  to  distinguish  between  them , we  will 
refer  to  the  individual  instances  of  a process  as  instances  or  as 
specific  processes , and  we  will  refer  to  the  process  in  general 
as  the  generic  process . Similarly,  we  will  refer  to  individual 
occurrences  of  a task  as  occur  rer.ces  or  as  specific  tasks  and  to 
the  task  in  general  as  a generic  task.  Recall  that  processes  are 
instruction  streams  and  that  tasks  are  units  of  work. 
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possible 

to 

create 

or 

remove  processes 

at  any  time  it 

is 

desirable 

to 

do  so . 

We 

have  chosen  to 

create  processes 

in 

anticipation  of  need,  rather  than  creating  processes  on  demand 
(as  would  be  done  for  a time-sharing  system),  because  we  know  in 
advance  which  tasks  will  occur  and  that  they  will  recur  at 
frequent  intervals.  We  also  do  this  because  the  Voice  Funnel 
cannot  afford  either  the  delay  frcm  loading  code  or  the  delay 
from  allocating  a stack  and  private  data  and  updating  the 
scheduling  tables  each  time  a task  arrives  for  service.  Because 
the  most  commonly  occurring  tasks  will  be  very  brief  and  will  be 
performed  without  any  intermediate  waits,  there  will  be  no 
advantage  tc  interleaving  the  execution  of  twc  instances  of  one 
process  on  the  same  processor. 


Because  the  Voice  Funnel  will  be  a dedicated  application,  we 
have  endeavored  to  avoid  requiring  that  the  operating  system 
contain  facilities  which  would  add  unnecessary  overhead  cr  delay 
tc  Voice  Funnel  operation.  Cn  the  ether  hand,  we  have  attempted 
tc  define  operating  system  facilities  which  can  be  provided  in  a 
manner  which  will  permit  their  use  in  ether  situations.  Thus, 
while  we  anticipate  that  Voice  Funnel  processes  will  not  have  tc 
share  the  system,  alternative  applications  .such  as  a 
time-sharing  system  or  another  dedicated  application)  could  use 
the  same  facilities  (perhaps  with  elaboration)  tc  accomplish 
different  ends.  Thus,  if  it  suited  a particular  application, 


- 1 
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processes  might  terminate  voluntarily  rather  than  cycling, 
multiple  copies  might  not  be  loaded,  and  processes  might  be 
loaded  only  when  needed. 
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3.  Switch  Development 

While  the  basic  design  of  the  Butterfly  Switch  has  been 
previously  developed  and  described  [EEN  78],  many  of  the 
parameters  of  the  Butterfly  Switch  have  been  closely  examined  in 
the  past  auarter.  These  variations  on  the  basic  Eutterfly  Switch 

I 

should  have  a major  impact  on  the  performance  of  the  Eutterfly 
Kult iprocesso  r . 

This  section  presents  some  of  the  more  interesting 
variations  of  the  Eutterfly  Switch  which  we  have  explored.  After 
this  presentation,  we  will  summarize  the  switch  as  we  would 
implement  it  now,  although  further  development  may  cause  the 
actual  implementation  to  differ  from  this  definition.  The  design 
of  a Eutterfly  Switch  is  complex  enough  that  many  of  the  topics 
interact.  We  therefore  apologize  in  advance  for  discussing  some 
topics  before  they  have  been  properly  introduced. 

3.1  Parallel  Data  Paths 

Clearly,  the  bandwidth  of  the  Eutterfly  Switch  is  going  to 
be  an  important  parameter  of  a Eutterfly  Rult  iprocesso r . The 
simplest  way  to  improve  this  bandwidth  is  through  parallelism  in 
the  data  paths  of  the  switch.  This  also  has  the  advantage  of 
amortizing  the  overhead  of  the  control  portion  of  the  switch  — 
the  control  logic  in  an  KSI  implementation,  and  tr.e  control  pins 
in  an  LSI  implementation.  At  times,  we  will  use  the  term 
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thickness  as  a synonym  for  data  parallelism  when  referring  to  the 
dimensions  of  a Butterfly  Switch. 

It  is  not  necessary  to  go  to  extremes,  when  introducing 
parallelism,  by  designing  switches  which  permit  the  transaction 
to  be  only  one  tick  long;  a small  amount  of  parallelism  (such  as 
2 or  8 tits  parallel)  produces  a switch  in  which  one  path  has  a 
bandwidth  cf  many  tens  of  megabits.  Large  amounts  of  parallelism 
become  unattractive  when  the  marginal  system'  advantage  of  an 
extra  data  path  teccmes  less  than  the  marginal  increase  in  the 
cost  cf  the  switch. 

3-2  Switch  Node  Ease 

Frevicus  descriptions  of  the  Butterfly  Switch  have  assumed 
that  the  switch  nodes  have  two  inputs  and  two  outputs.  This 
choice  is  somewhat  arbitrary;  in  fact,  a switch  node  with  any 
number  cf  inputs  and  outputs  (greater  than  1)  could  be  used. 
Indeed,  we  feel  that  a switch  node  with  4 inputs  and  4 outputs  is 
much  better  because  it  reduces  the  number  cf  switch  nodes  by  a 
factor  cf  4 and  the  number  cf  interconnections  by  a factor  of  2. 
Indeed,  a switch  cf  base  8 would  reduce  the  number  cf  nodes 
further.  However,  as  we  will  see,  large  bases  lead  to  less 
modular  switch  sizes  and  larger  switch  node  implementations.  A 
choice  cf  4 for  the  base  seems  a good  compromise. 
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[i 


(N/E) *Log[ Ease  E](N) 

Unfortunately  for  bases  other  than  two,  the  easily 
recognizable  structure  of  the  Fast  Fourier  Transform  is  lost. 
For  example,  Figure  3-2-1  shows  the  interconnection  pattern  for  a 
16  X 16  Butterfly  Switch  constructed  from  base-4  switch  nodes. 
For  comparison,  Figure  3-2-2  shows  a 16  X 16  Butterfly  Switch 
using  base-2  Switch  Lodes. 

From  these  two  figures,  it  is  obvious  that  the  switch  with 
the  higher  base  has  fewer  nodes  and  fewer  wires.  We  can  quantify 
the  impact  of  the  selection  of  a base  on  the  size  and  structure 
of  the  switch  in  hopes  that  this  will  lead  us  to  an  optimum  base 
for  the  Eutterfly  Switches  we  will  construct.  These  calculations 
assume  that  the  number  of  ports  is  an  integral  power  of  the  base 
of  the  switch.  Other  numbers  of  inputs  and  outputs  will  be 
discussed  later.  The  calculations  count  the  number  of  input  pert 
wires  but  not  the  number  of  output  port  wires.  The  difference  is 
N wires. 


! 


- 1 
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.figure  3*2-2 


A Butterfly  Switch  of  Base  2 


S = Number  cf  switch  nodes 
N = Number  of  ports  into  the  switch 
B = Base  of  the  switch 
C = Number  cf  columns 
W = Number  of  interconnections 
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Then, 

C = L0G[Ease  B](N) 

S = (N/B)  * LCG[Ease  B](N) 

W = B * S 

If  we  wish  to  compare  switches  of  two  bases,  we  can  derive 
the  ratios  of  the  number  of  switch  nodes  and  the  number  of  wires 
in  two  switches  as  follows,  assuming  an  equal  number  of  input 
ports : 

S1/S2  = ( E2  * L0G( E2 ) ) / ( B 1 * LOG(BI)) 

W1/W2  = (L0G(E2)/L0G(E1 ) ) 

The  following  table  compares  base-2  switches  with  higher 
base  switches.  In  each  case,  the  table  entry  gives  the 
improvement  provided  by  the  higher  base.  For  example,  a base-2 
switch  has  12  times  as  many  switch  nodes  as  a base-6  switch. 

BASE  S1/S2  W1/W2 

2 1 1 

4 4 2 

8 12  3 

16  32  4 

Thus  a switch  with  a larger  base  has  fewer  switch  nodes  and 
fewer  interconnections.  Switches  with  large  bases  have  a further 
advantage  ever  these  with  small  bases  in  that  a single  switch 
node  can  be  built  as  a B X B crcsspcint  switch.  This  reduces  the 
number  of  conflicts  in  the  network,  thereby  increasing  the 
performance  of  the  switch. 
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While  the  reduction  in  the  number  of  collisions  is  an 
important  advantage  of  larger  bases  to  the  performance  of  the 
network,  we  should  also  note  that  a higher  base  implies  fe  or 
columns.  This  will  decrease  the  actual  delay  across  the  switch 
to  a small  extent. 

While  these  arguments  suggest  that  the  largest  possible  base 
should  be  selected,  there  are  problems  with  large  bases.  In 
particular,  the  Eutterfly  Switch  dees  net  grow  as  smoothly  as  we 
had  earlier  expected.  Although  it  is  practical  to  make  a large 
range  of  switch  sizes  from  the  same  switch  node,  there  are  rather 
sharp  growth  points  where  the  size  cf  the  switch  must  be 
increased  significantly  to  add  one  more  port.  We  will  discuss 
these  issues  mere  in  the  next  section. 

In  light  of  these  considerations,  we  expect  to  build  cur 
switch  using  base-4  nodes  since  it  achieves  a good  compromise 
between  the  complexity  cf  the  switch  nede  and  the  improved 
performance  cf  the  switch. 

3*3  Fartial  Switches 

As  we  have  noticed  before,  Eutterfly  Switches  have  certain 
preferred  numbers  of  ports  which  result  in  complete  switches.  Ey 
complete,  we  mean  that  in  the  switch,  there  are  nc  nodes  which 
have  unused  inputs  cr  outputs.  These  numbers  are  a function  cf 
the  base.  The  number  cf  ports  in  a complete  switch  is  E* * i where 
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1 If  N is  not  an  integral  power  of  the  base,  then  the 

structure  of  the  switch  is  that  of  a switch  which  is  "the  next 
size  larger":  that  is,  one  with  E**i  ports  where  i is  the  next 
larger  integer.  This  leaves  a switch  which  has  many  unused 
ports,  and  potentially  even  unused  switch  nodes.  We  can  remove 
the  unused  switch  nodes  from  the  switch  network. 

Figure  3-3-1  is  an  example  of  a 1C  X 10  switch  which  has  had 
the  unneeded  switch  nodes  removed.  The  pattern  of  unneeded 
switch  nodes  is  much  more  complex  in  larger  switches. 

3-4  Long/Short  Messages 

The  pipelined  structure  of  a Eutterfly  Switch  implies  a 
relatively  larte  initial  delay  in  setting  up  a transfer,  followed 
cy  a very  high  transfer  rate.  This  means  that  the  Eutterfly 
Switch  favors  messages  which  are  long  ever  those  which  are  short. 
The  structure  cf  a tightly  coupled  multiprocessor,  however, 

. requires  the  ability  to  read  and  write  single  words  across  the 

switch.  As  a result,  we  have  decided  that  the  hardware  will 
support  both  single  word  transfers  and  multiple  word  transfers. 
This  decision  is  a matter  of  application  of  the  switch,  since  the 
switcr.  need  net  discriminate  between  these  two  message  types. 


Report  Ho.  4143 


Bolt  Eeranek  and  Hewman  Inc. 


one  path  through  the  switch  would  be  constantly  occupied.  We 
would  expect  this  to  block  many  short  messages  and  stop  many 
other  processors. 


The  simplest  solution  to  this  is  to  divide  long  transfers 
into  shorter  pieces,  of  say  16  words.  This  has  four  further 
advantages : 


1 . The  switch  interface  implementation  is  simplified 
because  the  messages  can  be  assembled  in  hardware 
buffers.  This  also  decouples  the  switch  rate  and  the 
processor  node  rate. 

2.  With  an  appropriate  switch  interface  design,  the  local 
processor  can  access  remote  memory  locations  during  the 
transfer  of  a long  data  block. 

3.  Shorter  messages  improve  throughput  when  switch 
transmission  errors  occur. 

4.  Should  the  processor  need  the  full  bandwidth  of  the 
memory  for  fast  interrupt  servicing,  the  sectioning  cf 
long  data  blocks  into  short  data  blocks  may  allow  the 
processor  to  temporarily  stop  a data  transfer  between 
the  small  blocks  during  the  interrupt  service  routine. 


3-5  Speed  Issues 


For  the  Voice  Funnel  application  an  important  speed  issue  is 
the  bandwidth  cf  a Butterfly  Switch.  This  bandwidth  is  a 
function  of  five  factors: 


1 . Switch  thickness  - The  bandwidth  is  linearly  related  to 
how  many  bits  move  across  the  switch  during  one  clock 
period . 

2.  Switch  clock  frequency  - The  bandwidth  is  linearly 
related  to  the  switch  clock  frequency. 
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3.  Control  overhead  - As  each  message  consists  of  data  bits 
and  control  bits,  the  bandwidth  is  related  to  the  ratio 
of  data  bits  to  message  bits  (i.e.,  data  bits  plus 
control  bits)  . 

4.  Memory  bandwidth  - At  some  operating  point  of  the  above 
four  factors,  the  data  cannot  be  written  into  or  read 
from  the  local  memory  fast  enough  to  keep  the  switch 
path  busy  and  still  permit  the  local  processor  an 
occasional  access  to  its  memory. 

The  switch  clock  frequency  is  limited  by  the  time  it  takes 
to  establish  a connection  between  a switch  node  input  and  output 
port  or  by  the  time  it  takes  to  propagate  the  data  to  the  next 
switch  node.  In  very  large  switch  configurations,  the 
propagation  time  may  be  larger  than  the  connection  time,  ke  have 
estimated  the  maximum  clock  rate  using  standard  TTL  Schcttky  MSI 
components.  The  worst  case  connection  time  is  77  nsec  and  the 
worst  case  propagation  time  is  40  nsec.  Eecause  of  the  4 MHz 
maximum  clock  frequency  of  the  microprocessor,  a 12  MHz  switch 
clock  frequency  seems  to  be  a good  choice. 

Eecause  not  every  switch  clock  pericd  transfers  data,  the 
potential  bandwidth  of  46  MHz  is  further  reduced.  For  the 
unidirectional  switch,  the  table  below  shows  the  reduction  in 
bandwidth  due  to  the  inclusion  of  control  bits  in  each  message 
fcr  several  data  block  sizes:  (Ease-4,  125  processors,  thickness 
of  4 ) . 

Data  Message  Effective 

Eits  Size  Bandwidth  (MHz) 
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16 

96 

8 

32 

1 12 

13.7 

64 

144 

21  .3 

256 

336 

36.6 

As  can  te  seen,  data  sent  or  retrieved  from  remote  memory  in 
multiple  word  blocks  makes  much  better  utilization  of  the 
Eutterfly  Switch. 


3.6  Deadlocks 


Consider  a switch  interface  having  a single  input  message 
buffer  and  a single  output  message  buffer.  The  deadlock  shown  in 
Figure  3*6-1  can  be  imagined  as  resulting  from  the  following 
scenario:  1)  both  Node  A and  Node  E send  memory  "request" 
messages  to  each  ether  simultaneously;  2)  after  these  requests 
have  beer,  processed,  but  before  the  replies  have  beer,  sent, 
requests  from  two  other  nodes  (X  and  Y)  arrive  at  the  inputs.  At 
this  point,  the  "replies"  in  Node  A and  Node  E cannot  be  sent 
because  the  receive  buffers  are  occupied  by  the  foreign  requests. 
Yet,  nc  further  processing  of  requests  car.  te  done  . tc  free  the 
receive  buffers  until  the  replies  have  beer,  sent  --  the  system  is 
locked  up. 


However,  if  we  can  assure  that  a reply  message 
sent  and  accepted,  the  system  is  free  of  deadlocks, 
principle  will  help  us  understand  the  requirements 
of  the  switch  and  the  switch  interface  which  are 


car.  always  be 
This  guiding 
or.  the  design 
necessary  to 


c.  I 
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Bigure  3-6-1  A Deadlock 


assure  deadlock-free  operation.  To 
simulations  which  we  will  be  describing 
these  mechanisms  and  have  net  exhibited 


validate  these  ideas,  the 
later  have  implemented 
deadlocks . 
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If  we  are  to  assure  that  replies  are  always  accepted,  we 
must  provide  a mechanism  which  accepts  them  even  if  the  receiver 
buffer  is  occupied.  To  assure  this,  we  provide  two  buffers:  one 
for  requests  and  one  for  replies.  This  is  not  enough;  we  must 
also  have  a way  to  peek  at  each  message  and  turn  requests  away 
when  the  request  buffer  is  occupied  so  that  replies  can  always  be 
accepted  at  the  entry  pert  to  the  processor  node.  This  function 
can  be  performed  by  first  accepting  the  header  of  the  message  and 
then  asserting  the  REJECT  signal  if  it  is  a request  and  the 
request  buffer  is  busy. 

Note  that  it  is  this  requirement  which  leads  to  deadlocks 
when  the  Wait  strategy  is  applied  to  the  one  way  switch.  The 
Wait  strategy  has  no  mechanism  for  turning  away  requests  and 
letting  replies  through. 

If  we  are  to  assure  the  path  of  a reply,  we  must  also  assure 
that  it  can  be  transmitted  in  the  first  place.  The 
straightforward  way  to  do  this  is  to  provide  two  transmit  buffers 
and  allocate  one  for  replies  only.  Then,  when  a request  is 
rejected,  before  retransmitting  the  request,  the  transmitter 
should  check  the  reply  buffer  and  if  a reply  is  present,  send  it 
instead  of  the  request. 

Finally,  we  are  counting  on  randomness  in  the  switch  to 
assure  that  replies  do  not  repeatedly  encounter  the  same  or  other 
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request  messages  in  the  switch,  thus  also  leading  to  deadlocks. 
This  is  a level  of  detail  which  we  have  not  yet  explored. 

3.7  Error  Control 

We  have  used  the  analogy  of  the  Eutterfly  Switch  as  a 
communications  network  before.  As  we  know,  one  of  the  important 
characteristics  of  a communications  network  is  its  handling  of 
errors.  We  expect  two  things  of  such  a network:  1)  extremely  low- 
probability  of  undetected  errors,  and  2)  automatic  recovery  from 
errors . 

Although  the  Butterfly  Switch  is  within  a computer  system, 
we  should  still  expect  errors  to  occur  — if  not  because  of 
noise,  then  because  of  either  intermittent  or  solid  component 
failures.  We  have  discussed  the  introduction  of  extra  columns  in 
the  switch  to  provide  extra  paths  which  will  permit  operation 
once  failed  components  have  been  identified.  In  order  to  detect 
errors  when  they  occur,  we  expect  to  provide  check  tit  . on  each 
transact  ion. 

Errors  in  the  data  of  a message  will  be  detected  by  these 
check  bits,  but  errors  in  addressing  will  not  be.  A simple  way 
tc  detect  addressing  errors  is  to  include  the  destination  address 
in  the  checksum  when  the  message  is  sent  and  to  have  the 
destination  processor  node  include  its  own  address  when  it 
verifies  the  checksum.  Thus,  if  the  message  reaches  the  correct 
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destination  without  error,  the  checksum  will  be  correct. 
Otherwise,  an  error  will  be  detected.  This  method  has  the 
advantage  of  avoiding  the  transmission  of  the  destination  address 
in  the  text  of  the  message. 

3.8  Flow  Control 

Another  characteristic  which  we  have  come  to  expect  of 
communications  networks  is  a facility  for  flow  control.  This  is 
necessary  whenever  we  cannot  guarantee  that  the  receiver  has 
sufficient  resources  to  accept  what  the  transmitter  may  wish  to 
send.  Cur  initial  design  for  the  switch  interface  implies  a very 
structured  buffering  arrangement  with  few  message  fcrmat 
variations.  As  a result,  cur  initial  design  would  net  need  flew 
control  in  the  switch. 


3-9  Current  Switch  Design 

In  summary,  we  felt  that  the  best  switch  design  fer  the 
range  of  machines  currently  anticipated  is  as  follows: 

1.  Conflict  Resolution  - We  will  use  the  "retreat" 
strategy . 

2.  Parallel  Data  Paths  - Data  parallelism  of  4 tits  ^called 
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5.  Long/Short  Messages  - Both  long  messages  (multiple  word 
transfers)  and  short  messages  (single  word  transfers) 
will  be  supported.  To  keep  the  latency  of  short 
messages  low,  long  messages  will  be  broken  into  blocks 
of  16  words  or  less. 

6.  Speed  - Ive  expect  the  switch  to  clock  at  between  1C  and 
12  MHz  for  an  effective  bandwidth  of  40  to  48  MHz  per 
path.  In  a switch  with  256  ports,  this  implies  a 
maximum  aggregate  data  rate  of  10  to  12  gigahertz. 

7.  Deadlocks  - we  will  avoid  deadlocks  by  using  the 
"retreat"  strategy  in  the  switch,  and  by  designing  the 
switch  interface  so  that  it  can  always  accept  a reply. 

8.  Error  Control  - The  switch  itself  will  not  perform  any 
error  control,  but  the  switch  interfaces  will  use  check 
bits  to  detect  errors  in  data  or  routing. 


9.  Flow  Control  - None  in  the  switch.  The  sender  and 
receiver  will  be  designed  sc  that  the  transactions  dc 
net  need  f^ow  control. 
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